{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 处理文本数据\n",
    "\n",
    "\n",
    "1. 读取文件内容以及所属的类别\n",
    "2. 提取合适于机器学习的特征向量\n",
    "3. 训练一个线性模型来进行分类\n",
    "4. 使用网格搜索策略找到特征提取组件和分类器的最佳配置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 从文本文件中提取特征\n",
    "\n",
    "使用 scikit-learn 来对文本进行分词\n",
    "\n",
    "文本的预处理, 分词以及过滤停用词都被包含在一个可以构建特征字典和将文档转换成特征向量的高级组件中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "count_vect = CountVectorizer()\n",
    "X_train_counts = count_vect.fit_transform(twenty_train.data)\n",
    "X_train_counts.shape\n",
    "#(2257, 35788)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CountVectorizer 支持单词或者连续字符的 N-gram 模型的计数。 一旦拟合， 向量化程序就会构建一个包含特征索引的字典:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "count_vect.vocabulary_.get(u'algorithm')\n",
    "#4690  在词汇表中一个单词的索引值对应的是该单词在整个训练的文集中出现的频率。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 从出现次数到出现频率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# tf 和 tf–idf 都可以按照下面的方式计算:\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)\n",
    "X_train_tf = tf_transformer.transform(X_train_counts)\n",
    "X_train_tf.shape\n",
    "#(2257, 35788)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上面的样例代码中，我们首先使用了 fit(..) 方法来拟合对数据的 estimator（估算器），接着使用 transform(..) 方法来把我们的计数矩阵转换成 tf-idf 型。 通过跳过冗余处理，这两步可以结合起来，来更快地得到同样的结果。这种操作可以使用上节提到过的 fit_transform(..) 方法来完成，如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfidf_transformer = TfidfTransformer()\n",
    "X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)\n",
    "X_train_tfidf.shape\n",
    "(2257, 35788)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 训练分类器\n",
    "\n",
    "现在我们有了我们的特征，我们可以训练一个分类器来预测一个帖子所属的类别。 让我们从 朴素贝叶斯 分类器开始. 该分类器为该任务提供了一个好的基准（baseline）. scikit-learn 包含了该分类器的若干变种；最适用在该问题上的变种是多项式分类器:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了尝试预测新文档所属的类别，我们需要使用和之前同样的步骤来抽取特征。 不同之处在于，我们在transformer调用 transform 而不是 fit_transform ，因为这些特征已经在训练集上进行拟合了:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs_new = ['God is love', 'OpenGL on the GPU is fast']\n",
    "X_new_counts = count_vect.transform(docs_new)\n",
    "X_new_tfidf = tfidf_transformer.transform(X_new_counts)\n",
    "\n",
    "predicted = clf.predict(X_new_tfidf)\n",
    "\n",
    "for doc, category in zip(docs_new, predicted):\n",
    "    print('%r => %s' % (doc, twenty_train.target_names[category]))\n",
    "...\n",
    "'God is love' => soc.religion.christian\n",
    "'OpenGL on the GPU is fast' => comp.graphics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 构建 Pipeline（管道）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "text_clf = Pipeline([('vect', CountVectorizer()),\n",
    "                      ('tfidf', TfidfTransformer()),\n",
    "                    ('clf', MultinomialNB()),\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "名称 vect, tfidf 和 clf （分类器）都是任意的。 我们将会在下面的网格搜索（grid search）小节中看到它们的用法。 \n",
    "现在我们可以使用下面的一行命令来训练模型:\n",
    "\"\"\"\n",
    "text_clf.fit(twenty_train.data, twenty_train.target) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "gnb = GaussianNB()\n",
    "y_pred = gnb.fit(iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#在测试集上的性能评估，评估模型的预测准确度同样简单:\n",
    "\n",
    "import numpy as np\n",
    "twenty_test = fetch_20newsgroups(subset='test',\n",
    "     categories=categories, shuffle=True, random_state=42)\n",
    "docs_test = twenty_test.data\n",
    "predicted = text_clf.predict(docs_test)\n",
    "np.mean(predicted == twenty_test.target)            \n",
    "#0.834..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### cross_validate 函数和多度量评估\n",
    "\n",
    "return_train_score 默认设置为 True 。 它增加了所有 scorers(得分器) 的训练得分 keys 。如果不需要训练 scores ，则应将其明确设置为 False 。\n",
    "\n",
    "你还可以通过设置return_estimator=True来保留在所有训练集上拟合好的估计器。\n",
    "\n",
    "可以将多个测度指标指定为list，tuple或者是预定义评分器(predefined scorer)的名字的集合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.metrics import recall_score\n",
    "scoring = ['precision_macro', 'recall_macro']\n",
    "clf = svm.SVC(kernel='linear', C=1, random_state=0)\n",
    "scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,\n",
    "                        cv=5)\n",
    "sorted(scores.keys())\n",
    "#['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']\n",
    "scores['test_recall_macro']                       \n",
    "#array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  confusion matrix（混淆矩阵）\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "metrics.confusion_matrix(twenty_test.target, predicted)\n",
    "\n",
    "print(metrics.classification_report(twenty_test.target, predicted,\n",
    "     target_names=twenty_test.target_names)) \n",
    "\n",
    "##precision    recall  f1-score   support"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 调整估计器的超参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#GridSearchCV 提供的网格搜索从通过 param_grid 参数确定的网格参数值中全面生成候选。例如，下面的 param_grid:\n",
    "param_grid = [\n",
    "  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},\n",
    "  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},\n",
    " ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RandomizedSearchCV 实现了对参数的随机搜索, 其中每个设置都是从可能的参数值的分布中进行取样。 这对于穷举搜索有两个主要优势:\n",
    "\n",
    "    1. 可以选择独立于参数个数和可能值的预算\n",
    "    2. 添加不影响性能的参数不会降低效率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\"\"\"\n",
    "很明显, 如此地穷举是非常耗时的。 如果我们有多个 CPU，通过设置 n_jobs 参数可以进行并行处理。\n",
    "如果我们将该参数设置为 -1 ， 该方法会使用机器的所有 CPU 核心:\n",
    "\"\"\"\n",
    "gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 模型持久化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 可以通过使用 Python 的内置持久化模块（即 pickle ）将模型保存:\n",
    "from sklearn import svm\n",
    "from sklearn import datasets\n",
    "clf = svm.SVC()\n",
    "iris = datasets.load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "clf.fit(X, y)  \n",
    "\n",
    "\n",
    "import pickle\n",
    "s = pickle.dumps(clf)\n",
    "clf2 = pickle.loads(s)\n",
    "clf2.predict(X[0:1])\n",
    "#在scikit的具体情况下，使用 joblib 替换 pickle（ joblib.dump & joblib.load ）可能会更有趣，这对大数据更有效，但只能序列化 (pickle) 到磁盘而不是字符串变量:"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
